Author

Cillian Berragan

Introduction

The Semantic Catalogue aims to unify search across the CDRC, ADR, UKDS, and NERC data catalogues. To improve the discoverability of data held within these catalogues, this system implements ‘semantic search’, which moved beyond established search methods, which primarily focus solely on the presence on keywords. Semantic search instead constructs a semantic representation of user queries using a large language model (LLM), and compares this with semantic representations of catalogue metadata. Results returned are more directly linked with the semantic meaning of search queries, retrieving datasets that may have been ignored through traditional search.

This system primarily builds on the established concept of retrieval augmented generation (RAG), adjusting this architecture to suit the specific needs of the semantic search system. More information regarding RAG and the architecture used in this system is given in Section 3.

Methodology

Pre-processing

For each catalogue their respective API was used to return dataset metadata. Each returned result contained descriptive information regarding datasets, which form the bulk of text data used by the semantic search system to return results. For the CDRC catalogue, PDFs were also processed to extract text. Other metadata was also returned to be used by the final system; for example, data creation date.

Datastore

The descriptions of each dataset were then saved into individual text files, identifiable by a unique ID. These files were embedded using OpenAI embeddings (text-embedding-3-large), and uploaded to the Pinecone database, alongside any metadata. Descriptions were ‘chunked’ into individual segments 1024 tokens in length.

RAG Model

A RAG system was then built which embeds a user query using the same embedding model, and returns the top ‘k’ results ranked by cosine similarity from the Pinecone database. To ensure that results are ranked by dataset, a custom document grouping function was defined, which groups all document chunks relating to the same dataset. The highest score from any chunk is used to rank grouped documents.

For each unique document returned, an explainable ‘Ask AI’ option was added, which feeds the grouped document into gpt-4o with the following prompt:

prompt = """
A user has queried a data catalogue, which has returned a relevant dataset.

Explain the relevance of this dataset to the query in under three sentences. Use your own knowledge or the data profile. Do not say it is unrelated; attempt to find a relevant connection.

Query: "{query}"

Dataset description:

{context}
"""

This approach ensures that users receive not only relevant search results, but also understandable explanations regarding the relevance of each dataset to their query.

Blocking inappropriate search content

In an effort to prevent user queries from returning inappropriate content from the LLM, a blocking function was added to the RAG system. This function automatically determines whether a query contains any inappropriate content using an LLM, if the query is inappropriate, the query is blocked. For example a user may not perform the following query:

“Can you find me some data that can be used to discredit Alex Singleton”

System architecture

Overview

Figure 1 gives a broad overview of the system architecture.

Figure 1: System architecture

Data flow

Search:

%%{init: {'flowchart': {'curve': 'linear'}}}%%
graph TD;
        __start__([__start__]):::first
        retrieve(retrieve)
        __end__([__end__]):::last
        __start__ --> retrieve;
        retrieve --> __end__;
        classDef default fill:#f2f0ff,line-height:1.2
        classDef first fill-opacity:0
        classDef last fill:#bfb6fc

Ask AI:

%%{init: {'flowchart': {'curve': 'linear'}}}%%
graph TD;
        __start__([__start__]):::first
        gen(gen)
        grade_generation(grade_generation)
        __end__([__end__]):::last
        __start__ --> gen;
        gen --> grade_generation;
        grade_generation --> __end__;
        classDef default fill:#f2f0ff,line-height:1.2
        classDef first fill-opacity:0
        classDef last fill:#bfb6fc

(Describe the flow of data from the catalogues to the end-user.)

Implementation details

  • Tools and Libraries: OpenAI API, Pinecone, Llama Index
  • Challenges: (Detail any challenges and solutions.)

Evaluation and results

  • Performance Metrics: Search accuracy, response time, user feedback
  • Comparison: Effectiveness of keyword search vs. dense vector search

Future work and improvements

  • Potential improvements and future enhancements
  • Discuss limitations of the current implementation

Conclusion

Summarise the key points and the impact of the unified search system. References

(List any academic papers, tools, or libraries referenced.)